69 research outputs found
MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling
Personalized chatbots focus on endowing the chatbots with a consistent
personality to behave like real users and further act as personal assistants.
Previous studies have explored generating implicit user profiles from the
user's dialogue history for building personalized chatbots. However, these
studies only use the response generation loss to train the entire model, thus
it is prone to suffer from the problem of data sparsity. Besides, they
overemphasize the final generated response's quality while ignoring the
correlations and fusions between the user's dialogue history, leading to rough
data representations and performance degradation. To tackle these problems, we
propose a self-supervised learning framework MCP for capturing better
representations from users' dialogue history for personalized chatbots.
Specifically, we apply contrastive sampling methods to leverage the supervised
signals hidden in user dialog history, and generate the pre-training samples
for enhancing the model. We design three pre-training tasks based on three
types of contrastive pairs from user dialogue history, namely response pairs,
sequence augmentation pairs, and user pairs. We pre-train the utterance encoder
and the history encoder towards the contrastive objectives and use these
pre-trained encoders for generating user profiles while personalized response
generation. Experimental results on two real-world datasets show a significant
improvement in our proposed model MCP compared with the existing methods
JDsearch: A Personalized Product Search Dataset with Real Queries and Full Interactions
Recently, personalized product search attracts great attention and many
models have been proposed. To evaluate the effectiveness of these models,
previous studies mainly utilize the simulated Amazon recommendation dataset,
which contains automatically generated queries and excludes cold users and tail
products. We argue that evaluating with such a dataset may yield unreliable
results and conclusions, and deviate from real user satisfaction. To overcome
these problems, in this paper, we release a personalized product search dataset
comprised of real user queries and diverse user-product interaction types
(clicking, adding to cart, following, and purchasing) collected from JD.com, a
popular Chinese online shopping platform. More specifically, we sample about
170,000 active users on a specific date, then record all their interacted
products and issued queries in one year, without removing any tail users and
products. This finally results in roughly 12,000,000 products, 9,400,000 real
searches, and 26,000,000 user-product interactions. We study the
characteristics of this dataset from various perspectives and evaluate
representative personalization models to verify its feasibility. The dataset
can be publicly accessed at Github: https://github.com/rucliujn/JDsearch.Comment: Accepted to SIGIR 202
Retrieve Anything To Augment Large Language Models
Large language models (LLMs) face significant challenges stemming from their
inherent limitations in knowledge, memory, alignment, and action. These
challenges cannot be addressed by LLMs alone, but should rely on assistance
from the external world, such as knowledge base, memory store, demonstration
examples, and tools. Retrieval augmentation stands as a vital mechanism for
bridging the gap between LLMs and the external assistance. However,
conventional methods encounter two pressing issues. On the one hand, the
general-purpose retrievers are not properly optimized for the retrieval
augmentation of LLMs. On the other hand, the task-specific retrievers lack the
required versatility, hindering their performance across the diverse retrieval
augmentation scenarios.
In this work, we present a novel approach, the LLM-Embedder, which
comprehensively supports the diverse retrieval augmentation needs of LLMs with
one unified embedding model. Training such a unified model is non-trivial, as
various retrieval tasks aim to capture distinct semantic relationships, often
subject to mutual interference. To address this challenge, we systematically
optimize our training methodology. This includes reward formulation based on
LLMs' feedback, the stabilization of knowledge distillation, multi-task
fine-tuning with explicit instructions, and homogeneous in-batch negative
sampling. These optimization strategies contribute to the outstanding empirical
performance of the LLM-Embedder. Notably, it yields remarkable enhancements in
retrieval augmentation for LLMs, surpassing both general-purpose and
task-specific retrievers in various evaluation scenarios. Our checkpoint and
source code are publicly available at
https://github.com/FlagOpen/FlagEmbedding
Halothiobacillus neapolitanus Carboxysomes Sequester Heterologous and Chimeric RubisCO Species
Background: The carboxysome is a bacterial microcompartment that consists of a polyhedral protein shell filled with ribulose 1,5-bisphosphate carboxylase/oxygenase (RubisCO), the enzyme that catalyzes the first step of CO(2) fixation via the Calvin-Benson-Bassham cycle. Methodology/Principal Findings: To analyze the role of RubisCO in carboxysome biogenesis in vivo we have created a series of Halothiobacillus neapolitanus RubisCO mutants. We identified the large subunit of the enzyme as an important determinant for its sequestration into alpha-carboxysomes and found that the carboxysomes of H. neapolitanus readily incorporate chimeric and heterologous RubisCO species. Intriguingly, a mutant lacking carboxysomal RubisCO assembles empty carboxysome shells of apparently normal shape and composition. Conclusions/Significance: These results indicate that carboxysome shell architecture is not determined by the enzyme they normally sequester. Our study provides, for the first time, clear evidence that carboxysome contents can be manipulated and suggests future nanotechnological applications that are based upon engineered protein microcompartments
RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit
Although Large Language Models (LLMs) have demonstrated extraordinary
capabilities in many domains, they still have a tendency to hallucinate and
generate fictitious responses to user requests. This problem can be alleviated
by augmenting LLMs with information retrieval (IR) systems (also known as
retrieval-augmented LLMs). Applying this strategy, LLMs can generate more
factual texts in response to user input according to the relevant content
retrieved by IR systems from external corpora as references. In addition, by
incorporating external knowledge, retrieval-augmented LLMs can answer in-domain
questions that cannot be answered by solely relying on the world knowledge
stored in parameters. To support research in this area and facilitate the
development of retrieval-augmented LLM systems, we develop RETA-LLM, a
{RET}reival-{A}ugmented LLM toolkit. In RETA-LLM, we create a complete pipeline
to help researchers and users build their customized in-domain LLM-based
systems. Compared with previous retrieval-augmented LLM systems, RETA-LLM
provides more plug-and-play modules to support better interaction between IR
systems and LLMs, including {request rewriting, document retrieval, passage
extraction, answer generation, and fact checking} modules. Our toolkit is
publicly available at https://github.com/RUC-GSAI/YuLan-IR/tree/main/RETA-LLM.Comment: Technical Report for RETA-LL
Characterization of the chloroquine resistance transporter homologue in Toxoplasma gondii
Mutations in the Plasmodium falciparum chloroquine resistance transporter (PfCRT) protein confer resistance to the antima-larial drug chloroquine. PfCRT localizes to the parasite digestive vacuole, the site of chloroquine action, where it mediates resistance by transporting chloroquine out of the digestive vacuole. PfCRT belongs to a family of transporter proteins called the chlo-roquine resistance transporter family. CRT family proteins are found throughout the Apicomplexa, in some protists, and in plants. Despite the importance of PfCRT in drug resistance, little is known about the evolution or native function of CRT proteins. The apicomplexan parasite Toxoplasma gondii contains one CRT family protein. We demonstrate that T. gondii CRT (TgCRT) colocalizes with markers for the vacuolar (VAC) compartment in these parasites. The TgCRT-containing VAC is a highly dynamic organelle, changing its morphology and protein composition between intracellular and extracellular forms of the parasite. Regulated knockdown of TgCRT expression resulted in modest reduction in parasite fitness and swelling of the VAC, indicating that TgCRT contributes to parasite growth and VAC physiology. Together, our findings provide new information on the role of CRT family proteins in apicomplexan parasites
- β¦